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Abstract 



This paper considers the problem of completing a matrix with many missing entries under the assumption that 
the columns of the matrix belong to a union of multiple low-rank subspaces. This generalizes the standard low-rank 
matrix completion problem to situations in which the matrix rank can be quite high or even full rank. Since the 
columns belong to a union of subspaces, this problem may also be viewed as a missing-data version of the subspace 
clustering problem. Let X be an 71 x A'^ matrix whose (complete) columns lie in a union of at most k subspaces, each 
of rank <r<n, and assume N 3> kn. The main result of the paper shows that under mild assumptions each column 
of X can be perfectly recovered with high probability from an incomplete version so long as at least CrN\o^{n) 
entries of X are observed uniformly at random, with C > 1 a constant depending on the usual incoherence conditions, 
the geometrical arrangement of subspaces, and the distribution of columns over the subspaces. The result is illustrated 
with numerical experiments and an application to Internet distance matrix completion and topology identification. 

1 Introduction 

Consider a real-valued n x N dimensional matrix X. Assume that the columns of X lie in the union of at most k 
subspaces of M", each having dimension at most r < n and assume that N > kn. We are especially interested 
in "high-rank" situations in which the total rank (the rank of the union of the subspaces) may be n. Our goal is to 
complete X based on an observation of a small random subset of its entries. We propose a novel method for this 
matrix completion problem. In the applications we have in mind N may be arbitrarily large, and so we will focus on 
quantifying the probability that a given column is perfectly completed, rather than the probability that whole matrix is 
perfectly completed (i.e., every column is perfectly completed). Of course it is possible to translate between these two 
quantifications using a union bound, but that bound becomes meaningless if N is extremely large. 

Suppose the entries of X are observed uniformly at random with probability pq. Let f2 denote the set of indices 
of observed entries and let Xo denote the observations of X. Our main result shows that under a mild set of assump- 
tions each column of X can be perfectly recovered from Xo with high probability using a computationally efficient 
procedure if 



where C > 1 is a constant depending on the usual incoherence conditions as well as the geometrical arrangement of 
subspaces and the distribution of the columns in the subspaces. 

*The first two authors contributed equally to this paper. 
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1.1 Connections to Low-Rank Completion 

Low -rank matrix completion theory HI shows that an rt x iV matrix of rank r can be recovered from incomplete obser- 
vations, as long as the number of entries observed (with locations sampled uniformly at random) exceeds rN log^ 
(within a constant factor and assuming n < N). It is also known that, in the same setting, completion is impossible if 
the number of observed entries is less than a constant times rN log N These results imply that if the rank of X is 
close to n, then all of the entries are needed in order to determine the matrix. 

Here we consider a matrix whose columns lie in the union of at most k subspaces of M". Restricting the rank of 
each subspace to at most r, then the rank of the full matrix our situation could be as large as kr, yielding the require- 
ment krN log^ N using current matrix completion theory. In contrast, the bound in ([T]) implies that the completion of 
each column is possible from a constant times rN log^ n entries sampled uniformly at random. Exact completion of 
every column can be guaranteed by replacing log^ n with log^ N is this bound, but since we allow N to be very large 
we prefer to state our result in terms of per-column completion. Our method, therefore, improves significantly upon 
conventional low-rank matrix completion, especially when k is large. This does not contradict the lower bound in [21, 
because the matrices we consider are not arbitrary high-rank matrices, rather the columns must belong to a union of 
rank < r subspaces. 

1.2 Connections to Subspace Clustering 

Let xi, . . . , xn G and assume each Xi lies in one of at most k subspaces of M". Subspace clustering is the problem 
of learning the subspaces from {xi}fLi and assigning each vector to its proper subspace; cf. ||3] for a overview. This is 
a challenging problem, both in terms of computation and inference, but provably probably correct subspace clustering 
algorithms now exist |l4l|5]|6l. Here we consider the problem of high rank matrix completion, which is essentially 
equivalent to subspace clustering with missing data. This problem has been looked at in previous works |7, 8], but 
to the best of our knowledge our method and theoretical bounds are novel. Note that our sampling probability bound 
^ requires that only slightly more than r out of n entries are observed in each column, so the matrix may be highly 
incomplete. 

1.3 A Motivating Application 

There are many applications of subspace clustering, and it is reasonable to suppose that data may often be missing in 
high-dimensional problems. One such application is the Internet distance matrix completion and topology identifica- 
tion problem. Distances between networked devices can be measured in terms of hop-counts, the number of routers 
between the devices. Infrastructures exist that record distances from N end host computers to a set of n monitoring 
points throughout the Internet. The complete set of distances determines the network topology between the computers 
and the monitoring points [9]. These infrastructures are based entirely on passively monitoring of normal traffic. One 
advantage is the ability to monitor a very large portion of the Internet, which is not possible using active probing 
methods due to the burden they place on networks. The disadvantage of passive monitoring is that measurements col- 
lected are based on normal traffic, which is not specifically designed or controlled, therefore a subset of the distances 
may not be observed. This poses a matrix completion problem, with the incomplete distance matrix being potentially 
full-rank in this application. However, computers tend to be clustered within subnets having a small number of egress 
(or access) points to the Internet at large. The number of egress points in a subnet limits the rank of the submatrix of 
distances from computers in the subnet to the monitors. Therefore the columns of the n x N distance matrix lie in 
the union of k low-rank subspaces, where k is the number of subnets. The solution to the matrix completion problem 
yields all the distances (and hence the topology) as well as the subnet clusters. 

1.4 Related Work 

The proof of the main result draws on ideas from matrix completion theory, subspace learning and detection with 
missing data, and subspace clustering. One key ingredient in our approach is the celebrated results on low-rank Matrix 
Completion |[T]|2l[T0l. Unfortunately, in many real- world problems where missing data is present, particularly when 
the data is generated from a union of subspaces, these matrices can have very large rank values (e.g., networking data 
in mi). Thus, these prior results will require effectively all the elements be observed to accurately reconstruct the 
matrix. 
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Our work builds upon the results of lfT2l . which quantifies the deviation of an incomplete vector norm with respect 
to the incoherence of the sampling pattern. While this work also examines subspace detection using incomplete data, 
it assumes complete knowledge of the subspaces. 

While research that examines subspace learning has been presented in ||T3l . the work in this paper differs by 
the concentration on learning from incomplete observations (i.e., when there are missing elements in the matrix), 
and by the methodological focus {i.e., nearest neighbor clustering versus a multiscale Singular Value Decomposition 
approach). 

1.5 Sketch of Methodology 

The algorithm proposed in this paper involves several relatively intuitive steps, outlined below. We go into detail for 
each of these steps in following sections. 

Local Neighborhoods. A subset of columns of Xjj are selected uniformly at random. These are called seeds. A 
set of nearest neighbors is identified for each seed from the remainder of Xn. In Section [3] we show that nearest 
neighbors can be reliably identified, even though a large portion of the data are missing, under the usual incoherence 
assumptions. 

Local Subspaces. The subspace spanned by each seed and its neighborhood is identified using matrix completion. 
If matrix completion fails [i.e., if the resulting matrix does not agree with the observed entries and/or the rank of the 
result is greater than r), then the seed and its neighborhood are discarded. In Section|4]we show that when the number 
of seeds and the neighborhood sizes are large enough, then with high probability all k subspaces are identified. We 
may also identify additional subspaces which are unions of the true subspaces, which leads us to the next step. An 
example of these neighborhoods is shown in Figure [T] 

Subspace Refinement. The set of subspaces obtained from the matrix completions is pruned to remove all but k 
subspaces. The pruning is accomplished by simply discarding all subspaces that are spanned by the union of two or 
more other subspaces. This can be done efficiently, as is shown in Section|5] 

Full Matrix Completion. Each column in Xjj is assigned to its proper subspace and completed by projection onto 
that subspace, as described in Section |6] Even when many observations are missing, it is possible to find the correct 
subspace and the projection using results from subspace detection with missing data lfT2l . The result of this step is a 
completed matrix X such that each column is correctly completed with high probabiUty. 

The mathematical analysis will be presented in the next few sections, organized according to these steps. After proving 
the main result, experimental results are presented in the final section. 




Figure 1 : Example of nearest-neighborhood selecting points on from a single subspace. For illustration, samples from three one- 
dimensional subspaces are depicted as small dots. The large dot is the seed. The subset of samples with significant observed support 
in common with that of the seed are depicted by *'s. If the density of points is high enough, then the nearest neighbors we identify 
will belong to the same subspace as the seed. In this case we depict the ball containing the 3 nearest neighbors of the seed with 
significant support overlap. 
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2 Key Assumptions and Main Result 



The notion of incoherence plays a key role in matrix completion and subspace recovery from incomplete observations. 
Definition 1. The coherence of an r -dimensional subspace S C K." is 



I-l{S) := — maxjlP^e 



J II 2 



-, which is precisely the 



where Pg is the projection operator onto S and {e^} are the canonical unit vectors for i 

Note that 1 < p{S) < n/r. The coherence of single vector x e M" is fi{x) = — 
coherence of the one-dimensional subspace spanned by x. With this definition, we can state the main assumptions we 
make about the matrix X. 

Al. The columns of X lie in the union of at most k subspaces, with k = o{n'^) for some d > Q. The subspaces are 
denoted by Si, ... ^Sk and each has rank at most r < n. The £2-norm of each column is < 1. 

A2. The coherence of each subspace is bounded above by /io- The coherence of each column is bounded above by fii 
and for any pair of columns, xi and X2, the coherence of xi — X2 is also bounded above by /ii. 

A3. The columns of X do not lie in the intersection(s) of the subspaces with probability 1, and if rank(5i) = r;, then 
any subset of columns from Si spans Si with probability 1. Let < eo < 1 and Si^^g denote the subset of 
points in Si at least eo distance away from any other subspace. There exists a constant < I'o < 1, depending 
on eo, such that 

(i) The probability that a column selected uniformly at random belongs to Si^^o is at least fg/k. 

(ii) If a; g Si, eg, then the probability that a column selected uniformly at random belongs to the ball of radius 

eo centered at x is at least i/Qeg/k. 

The conditions of A3 are met if, for example, the columns are drawn from a mixture of continuous distributions on 
each of the subspaces. The value of i^o depends on the geometrical arrangement of the subspaces and the distribution 
of the columns within the subspaces. If the subspaces are not too close to each other, and the distributions within 
the subspaces are fairly uniform, then typically i^o will be not too close to 0. We define three key quantities, the 
confidence parameter Sq, the required number of "seed" columns sq, and a quantity £o related to the neighborhood 
formation process (see Algorithm[T]in Section|3]i: 



<5o 

So := 
4 



fc(logfc + logl/Jo) 
(1 - e-4)i.o 

f 2k 8k\ogiso/5o) 
max < — -- — ^ , — 7 



(2) 



V3^ 



We can now state the main result of the paper. 



Tlieorem 2.1. Let ~K be an n x N matrix satisfying A1-A3. Suppose that each entry ofX. is observed independently 
with probability po- If 



Po > 



128/3max{^^,^o} r log^(n) 



and 



then each column ofK. can be perfectly recovered with probability at least 1 
sketched above ( and detailed later in the paper). 



(6 + 15so) (5o, using the methodology 
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The requirements on sampling are essentially the same as those for standard low-rank matrix completion, apart 
from requirement that the total number of columns N is sufficiently large. This is needed to ensure that each of the 
subspaces is sufficiently represented in the matrix. The requirement on N is polynomial in n for fixed po, which is 
easy to see based on the definitions of So, sq, and £o (see further discussion at the end of Section[3]i. 

Perfect recovery of each column is guaranteed with probability that decreases linearly in sq, which itself is linear 
in k (ignoring log factors). This is expected since this problem is more difficult than k individual low-rank matrix 
completions. We state our results in terms of a per-column (rather than full matrix) recovery guarantee. A full matrix 
recovery guarantee can be given by replacing log^ n with log^ N. This is evident from the final completion step 
discussed in Lemma[8] below. However, since N may be quite large (perhaps arbitrarily large) in the applications we 
envision, we chose to state our results in terms of a per-column guarantee. 

The details of the methodology and lemmas leading to the theorem above are developed in the subsequent sections 
following the four steps of the methodology outlined above. In certain cases it will be more convenient to consider 
sampling the locations of observed entries uniformly at random with replacement rather than without replacement, as 
assumed above. The following lemma will be useful for translating bounds derived assuming sampling with replace- 
ment to our situation (the same sort of relation is noted in Proposition 3.1 in IT]). 

Lemma 1. Draw m samples independently and uniformly from {1, . . . , n} and let 17' denote the resulting subset of 
unique values. Let be a subset of size m selected uniformly at random from {!,..., n\. Let E denote an event 
depending on a random subset of {1, . . . , n}. If P(iJ(f2„i)) is a non-increasing function of m, then P(_B(ri')) > 

p(^(an)). 

Proof. For k ~ 1, . . . , to, let 17^ denote a subset of size k sampled uniformly at random from {!,..., n], and let 

to' = 

in 

p(£:(0')) = ^P(£:(si') |to' = fc)P(TO' = fc) 

k=a 

m 

= Y,nE{^k))nm' ^ k) 

rn 

> p(£;(a„0)^p(TO' = fc). 

fc=0 

□ 



3 Local Neighborhoods 

In this first step, s columns of Xq are selected uniformly at random and a set of "nearby" columns are identified for 
each, constituting a local neighborhood of size n. All bounds that hold are designed with probability at least 1 — Sq, 
where So is defined in (|2| above. The s columns are called "seeds." The required size of s is determined as follows. 

Lemma 2. Assutne A3 holds. If the number of chosen seeds, 

k{\ogk + \ogl/So) 



s > 



(1 - 



then with probability greater than 1 — Sq for each i — 1, . . . , k, at least one seed is in Si^^o ^^^d each seed column has 
at least 

Vo ■■= — r log (n) (3) 

observed entries. 

Proof. First note that from Theorem l2.1l the expected number of observed entries per column is at least 

128/3max{^f,^o} , 2/ n 
7] = r log [n) 



5 



Therefore, the number of observed entries rf in a column selected uniformly at random is probably not significantly 
less. More precisely, by Chernoff 's bound we have 

]P(^<??/2) < exp(-r;/8) < e'^ . 

Combining this with A3, we have the probability that a randomly selected column belongs to 5^,^,, and has rj/2 or 
more observed entries is at least i^g/fc, where i^'q := (1 — e^'')i/o. Then, the probability that the set of s columns does 
not contain a column from 5i_eo with at least 77/2 observed entries is less than (1 — v'^/ky . The probability that the 
set does not contain at least one column from with rj/2 or more observed entries, for i = 1, . . . , fc is less than 
(5o = fc(l — 1^0/ ky . Solving for s in terms of Sq yields 

_ logfc + logl/(5o 



loe 



k/ 1^1,-1 



The result follows by noting that log(x/ {x — 1)) > 1/a;, for a; > 1. 



□ 



Next, for each seed we must find a set of n columns from the same subspace as the seed. This will be accomplished 
by identifying columns that are eo-close to the seed, so that if the seed belongs to Si.e^, the columns must belong to 
the same subspace. Clearly the total number of columns N must be sufficiently large so that n or more such columns 
can be found. We will return to the requirement on a bit later, after first dealing with the following challenge. 

Since the columns are only partially observed, it may not be possible to determine how close each is to the seed. 
We address this by showing that if a column and the seed are both observed on enough common indices, then the 
incoherence assumption A2 allows us reliably estimate the distance. 

Lemma 3. Asswne A2 and let y ^ xi — X2, where xi and X2 are two columns of^. Assume there is a common set of 
indices of size q < n where both xi and xi are observed. Let uj denote this common set of indices and let denote 
the corresponding subset of y. Then for any 5q > 0, if the nwnber of commonly observed elements 



> 



Sfil log(2/,5o) , 



then with probability at least 1 — Sq 



Mi < -\\y. 

z q 



^ is the sum of q random variables drawn uniformly at random without replacement from the set 



Proof Note that || 2/^112 

{yl, . . . , y^j}, and E||?/i^||2 = ^Hylli- We will prove the bound under the assumption that, instead, the q variables 
are sampled with replacement, so that they are independent. By Lemma [1] this will provide the desired result. Note 
that if one variable in the sum 1 1 ! 1 2 is replaced with another value, then the sum changes in value by at most 2 1 1 y | j ^ . 
Therefore, McDiramid's Inequality shows that for t > 



l|2 9 II ||2 

IVi^h — Ily|l2 



> < < 2 exp 



or equivalently 



\y\\l 



>t] < 2 exp 



Assumption A2 implies that < A^iUj/Hl. and so we have 



2n2 



:il2/^ll2-l|y|l2 



>t] < 2 exp 



/ -qt^ 



Taking t 



y\\2 



2 yields the result. 



□ 



Suppose that xi G 5^.^^ (for some i) and that X2 ^ Si, and that both xi,X2 observe q > 2/iQ log(2/(5o) common 



indices. Let yuj denote the difference between xi and X2 on the common support set. If the partial distance ^\\yu. 
eg/2, then the result above implies that with probability at least 1 — (5o 



< 



\\XI -X2\\ 



< 



2^ 

q 
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On the other hand if a;2 G Si and — X2\\2 < eg /3, then with probabihty at least 1 — So 

-\\yJl < ^\\xi-x2\\l < 4/2. 

q z 

Using these results we will proceed as follows. For each seed we find all columns that have at least to > 2//g log(2/(5o) 
observations at indices in common with the seed (the precise value of to will be specified in a moment). Assuming 
that this set is sufficiently large, we will select £n these columns uniformly at random, for some integer (. > 1. In 
particular, i will be chosen so that with high probability at least n of the columns will be within eo/VS of the seed, 
ensuring that with probability at least 6o the corresponding partial distance of each will be within eo / V2- That is 
enough to guarantee with the same probability that the columns are within eo of the seed. Of course, a union bound 
will be needed so that the distance bounds above hold uniformly over the set of sin columns under consideration, 
which means that we will need each to have at least to := 2/iQ \og{2s£n/ So) observations at indices in common with 
the corresponding seed. All this is predicated on N being large enough so that such columns exist in Xo. We will 
return to this issue later, after determining the requirement for £. For now we will simply assume that > in. 

Lemma 4. Assume A3 and for each seed x let T^^eo denote tlie number of columns ofK. in tlie ball of radius eo/VS 
about X. If the number of columns selected for each seed, In, such that. 



I > max ■ 



2k 8k\og{s/do) 



f/ie« P (Ta; < n) < Sq for all s seeds. 

Proof. The probability that a column chosen uniformly at random from X belongs to this ball is at least i^oi^o/ V^Y /k, 
by Assumption A3. Therefore the expected number of points is 



By Chernoff's bound for any < 7 < 1 



< (1 - 7) \<^M -\ 



Take 7=1/2 which yields 



T,,eo < I < cxp ' 



2A: / - " \ 8k I 

We would like to choose I so that jfc — ^ ™d so that exp g^^-^ — ] < do/s (so that the desired result 

fails for one or more of the s seeds is less than So). The first condition leads to the requirement I > ^^^^y ■ The 
second condition produces the requirement £ > . □ 

We can now formally state the procedure for finding local neighborhoods in Algorithm[T] Recall that the number 
of observed entries in each seed is at least r/o, per Lemma|2] 

Lemma 5. If N is sufficiently large and rjo > to, then the Local Neighborhood Procedure in Algorithm\l\produces 
at least n columns within eq of each seed, and at least one seed will belong to each of Si^e^^, for i — 1, . . . ,k, with 
probability at least 1 — SJq. 

Proof. Lemma Instates that if we select sq seeds, then with probability at least 1 — So there is a seed in each Si^^o^ 
i = 1, . . . , fc, with at least 770 observed entries, where rjo is defined in (O. Lemma |4] implies that if Ion columns are 
selected uniformly at random for each seed, then with probability at least I — Sq for each seed at least n of the columns 
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Algorithm 1 - Local Neighborhood Procedure 



Input: n, k, fiQ, ep, i^o. Vo^ > 0. 



So 
to 



fc(logfc + logl/5o) 



(1 



2k 



max • 



8fclog(so/<5o) 



\2nl log(2so4n/(5o)l 



Steps: 

1. Select So "seed" columns uniformly at random and discard all with less than 7^0 observations 

2. For each seed, find all columns with to observations at locations observed in the seed 

3. Randomly select ion columns from each such set 

4. Form local neighborhood for each seed by randomly selecting n columns with partial distance less than eo/\/2 
from the seed 



will be within a distance eo/\/3 of the seed. Each seed has at least 770 observed entries and we need to find ion other 
columns with at least to observations at indices where the seed was observed. Provided that rjo > to, this is certainly 
possible if is large enough. It follows from Lemma [3] that £0" columns have at least to observations at indices 
where the seed was also observed, then with probability at least 1 — 60 the partial distances will be within eo / V^, 
which implies the true distances are within eq- The result follows by the union bound. □ 

Finally, we quantify just how large N needs to be. Lemma|4]also shows that we require at least 

2kn 8fclog(s/(5o) 1 



N > in > max ■ 



However, we must also determine a lower bound on the probability that a column selected uniformly at random has at 
least to observed indices in common with a seed. Let 70 denote this probability, and let po denote the probability of 
observing each entry in X. Note that our main result, Theorem |2.1| assumes that 



128/3max{^2^^o} log^W 

Po ^ ■ 

1^0 n 

Since each seed has at least 770 entries observed, 70 is greater than or equal to the probability that a Binomial (770,^0) 
random variable is at least to. Thus, 

j=to ^ ^ 

This implies that the expected number of columns with to or more observed indices in common with a seed is at least 
7oAf. If n is the actual number with this property, then by Chernoff's bound, P(n < 70-/V/2) < cxp(— 70-/V/8). So 
^ > "^iolo^n will suffice to guarantee that enough columns can be found for each seed with probability at least 
1 — So <ixp{— ion/A) > 1 — Sq since this will be far larger than 1 — Sq, since So is polynomial in n. 

To take this a step further, a simple lower bound on 70 is obtained as follows. Suppose we consider only a to-sized 
subset of the indices where the seed is observed. The probability that another column selected at random is observed 
at all to indices in this subset is Pq°. Clearly 70 > Pq" = exp(to logpo) > (2so£o"-)^^° logpo rpjjj^ yjeids the following 
sufficient condition on the size of N: 

N > ion{2soion/6of''"^°^P°' . 



8 



From the definitions of sq and io, this implies that if 2fj,Q logpo is a fixed constant, then a sufficient number of 
columns will exist if = 0(poly(fcn/(5o)). For example, if /Iq = 1 and pq — 1/2, then N — 0{{kn) /Sq)'^-^) 
will suffice; i.e., N need only grow polynomially in n. On the other hand, in the extremely undersampled case pq 
scales like log^(n)/n (as n grows and r and k stay constant) and N will need to grow almost exponentially in n, like 

log n- 2 log logTi 



4 Local Subspace Completion 

For each of our local neighbor sets, we will have an incompletely observed nxn matrix; if all the neighbors belong to 
a single subspace, the matrix will have rank < r. First, we recall the following result from low-rank matrix completion 
theory lli). 

Lemma 6. Consider an n x n matrix of rank < r and row and column spaces with coherences bounded above by 
some constant /iq- Then the matrix can be exactly completed if 

m' > 64 max {^j, ^a) Prnlog^ {2n) (4) 

entries are observed uniformly at random, for constants (3 > and with probability > 1 — 6(2n)^ logn— n^"^*^ ' . 

We wish to apply these results to our local neighbor sets, but we have three issues we must address: First, the 
sampling of the matrices formed by local neighborhood sets is not uniform since the set is selected based on the 
observed indices of the seed. Second, given Lemma|2]we must complete not one, but sq (see Algorithm[Tl) incomplete 
matrices simultaneously with high probability. Third, some of the local neighbor sets may have columns from more 
than one subspace. Let us consider each issue separately. 

First consider the fact that our incomplete submatrices are not sampled uniformly. The non-uniformity can be 
corrected with a simple thinning procedure. Recall that the columns in the seed's local neighborhood are identified 
first by finding columns with sufficient overlap with each seed's observations. To refer to the seed's observations, we 
will say "the support of the seed." 

Due to this selection of columns, the resulting neighborhood columns are highly sampled on the support of the 
seed. In fact, if we again use the notation q for the minimum overlap between two columns needed to calculate 
distance, then these columns have at least q observations on the support of the seed. Off the support, these columns 
are still sampled uniformly at random with the same probability as the entire matrix. Therefore we focus only on 
correcting the sampling pattern on the support of the seed. 

Let t be the cardinality of the support of a particular seed. Because all entries of the entire matrix are sampled 
independently with probability po, then for a randomly selected column, the random variable which generates t is 
binomial. For neighbors selected to have at least q overlap with a particular seed, we denote t' as the number of 
samples overlapping with the support of the seed. The probability density for t' is positive only for j = q, . . . ,t, 

P 

where p = Y,]^^ (*)Po(l ~ PoY~'- 

In order to thin the common support, we need two new random variables. The first is a bernoulli, call it Y, which 
takes the value 1 with probability p and with probability 1 — p. The second random variable, call it Z, takes values 
j = 0, . . . , q — 1 with probability 



(*y„(i-po)*--'" 



Define t" = t'Y + Z{1 - Y). The density oft" is 

= .?)(1-P) J=0,. ..,(?-! 
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which equal to the desired binomial distribution. Thus, the thinning is accomplished as follows. For each column draw 
an independent sample of Y. If the sample is 1, then the column is not altered. If the sample is 0, then a realization of 
Z is drawn, which we denote by z. Select a random subset of size z from the observed entries in the seed support and 
discard the remainder. We note that the seed itself should not be used in completion, because there is a dependence 
between the sample locations of the seed column and its selected neighbors which cannot be eliminated. 
Now after thinning, we have the following matrix completion guarantee for each neighborhood matrix. 

Lemma 7. Assume all sq seed neighborhood matrices are thinned according to the discussion above, have rank < r, 
and the matrix entries are observed uniformly at random with probability, 

^ 128;3max{/if,/io} r log^(n) 
~ uq n 

Then with probability > 1 — 12sQ'n?~'^^ ' log n, all sq matrices can be perfectly completed. 

Proof. First, we find that if each matrix has 

m! > 64 max (^^, /io) I3rn log^ (2n) 

entries observed uniformly at random (with replacement), then with probability > 1 — \2sQn?~'^^ ' logn, all sq 
matrices are perfectly completed. This follows by Lemma|6l the observation that 

6 {2nf-^'' \ogn + n^-^f"" < 12n^-^^''" logn , 

and a simple application of the union bound. 

But, under our sampling assumptions, the number of entries observed in each seed neighborhood matrix is random. 
Thus, the total number of observed entries in each is guaranteed to be sufficiently large with high probability as follows. 
The random number of entries observed in an ti x n matrix is m ~ Binomial(po, n^)- By Chernoff's bound we have 
P(m < < exp(— By the union bound we find that m > m' entries are observed in each of the so 

seed matrices with probability at least 1 — cxp(— + logso) ifpo ^ i28ff maxl^^.^o} r_\o^jn}^ 

Since n^po > rn \og^ n and sq = 0(fc(logfc + logn)), this probabihty tends to zero exponentially in n as 
long as fc = o(e"), which holds according to Assumption Al. Therefore this holds with probability at least 1 — 

12son^-^f^''^' logn. □ 

Finally, let us consider the third issue, the possibiUty that one or more of the points in the neighborhood of a seed 
lies in a subspace different than the seed subspace. When this occurs, the rank of the submatrix formed by the seed's 
neighbor columns will be larger than the dimension of the seed subspace. Without loss of generality assume that we 
have only two subspaces represented in the neighbor set, and assume their dimensions are r' and r" . First, in the case 
that r' + r" > r, when a rank > r matrix is completed to a rank r matrix, with overwhelming probability there will 
be errors with respect to the observations as long as the number of samples in each column is 0(r log r), which is 
assumed in our case; see lfT2ll . Thus we can detect and discard these candidates. Secondly, in the case that r' + r" < r, 
we still have enough samples to complete this matrix successfully with high probability. However, since we have 
drawn enough seeds to guarantee that every subspace has a seed with a neighborhood entirely in that subspace, we 
will find that this problem seed is redundant. This is determined in the Subspace Refinement step. 



5 Subspace Refinement 

Each of the matrix completion steps above yields a low-rank matrix with a corresponding column subspace, which 
we will call the candidate subspaces. While the true number of subspaces will not be known in advance, since 
So — 0{k{\ogk + log(l/(5o)), the candidate subspaces will contain the true subspaces with high probability (see 
LemmalU. We must now deal with the algorithmic issue of determining the true set of subspaces. 

We first note that, from Assumption A3, with probability 1 a set of points of size > r all drawn from a single 
subspace S of dimension < r will span S. In fact, any b < r points will span a 6-dimensional subspace of the 
r-dimensional subspace S. 
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Assume that r < n, since otherwise it is clearly necessary to observe all entries. Therefore, if a seed's nearest 
neighborhood set is confined to a single subspace, then the columns in span their subspace. And if the seed's nearest 
neighborhood contains columns from two or more subspaces, then the matrix will have rank larger than that of any 
of the constituent subspaces. Thus, if a certain candidate subspace is spanned by the union of two or more smaller 
candidate subspaces, then it follows that that subspace is not a true subspace (since we assume that none of the true 
subspaces are contained within another). 

This observation suggests the following subspace refinement procedure. The sq matrix completions yield s < sq 
candidate column subspaces; s may be less than sq since completions that fail are discarded as described above. First 
sort the estimated subspaces in order of rank from smallest to largest (with arbitrary ordering of subspaces of the 
same rank), which we write as . • ■ , We will denote the final set of estimated subspaces as Si, . . . , Sk- The 
first subspace Si := 5(1), a lowest-rank subspace in the candidate set. Next, S2 = 5(2) if and only if 5(2) is not 
contained in Si. Following this simple sequential strategy, suppose that when we reach the candidate 5(j) we have so 
far determined Si, . . . , Si, i < j. If 5(j) is not in the span of U^^-^^Se, then we set Si+i = 5(j), otherwise we move 
on to the next candidate. In this way, we can proceed sequentially through the rank-ordered list of candidates, and we 
will identify all true subspaces. 



6 The Full Monty 

Now all will be revealed. At this point, we have identified the true subspaces, and all N columns lie in the span of 
one of those subspaces. For ease of presentation, we assume that the number of subspaces is exactly k. However if 
columns lie in the span of fewer than k, then the procedure above will produce the correct number To complete the 
full matrix, we proceed one column at a time. For each column of Xq, we determine the correct subspace to which 
this column belongs, and we then complete the column using that subspace. We can do this with high probability due 
to results from 1 12, 14j. 

The first step is that of subspace assignment, determining the correct subspace to which this column belongs. 
In lfT4l . it is shown that given k subspaces, an incomplete vector can be assigned to its closest subspace with high 
probability given enough observations. In the situation at hand, we have a special case of the results of 1141 because we 
are considering the more specific situation where our incomplete vector lies exactly in one of the candidate subspaces, 
and we have an upper bound for both the dimension and coherence of those subspaces. 

Lemma 8. Let {Si , . . . ,Sk} be a collection ofk subspaces of dimension < r and coherence parameter bounded above 
by pq. Consider column vector x with index set Vl G {1, . . . , n}, and define Pn.Sj = C/q (^(j^nj (^^^^) ' 

where is the orthononnal column span of Sj and is the column span of Sj restricted to the observed rows, fJ. 
Without loss of generality, suppose the column of interest x G Si. If A3 holds, and the probability of observing each 
entry of x is independent and Bernoulli with parameter 

128,3max{/ii, /io} r log^(n) 

Po > ■ 

n 

Then with probability at least 1 — (3(fc — 1) + 2)5q, 

\\xn- Pn,s^xn\\l=Q (7) 

and for j = 2, . . . ,k 

\\xn - Pn,s, xnWl > . (8) 



Proof. We wish to use results from lfT2l [T4l , which require a fixed number of measurements By Chernoff 's bound 

P(N<f^) <exprz!^ 
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Note that npo > 16r/31og^n, therefore exp ( ~ ) < < (5(,; in other words, we observe \rt\ > npo/2 

entries of x with probability 1 — So. This set f2 is selected uniformly at random among all sets of size but using 
Lemma[T]we can assume that the samples are drawn uniformly with replacement in order to apply results of lfT2lfT4l . 
Now we show that > npo/2 samples selected uniformly with replacement implies that 

rSr/io f2r\ r/io(l + g)^ 
; < — ^ loK — , — — 



where a > and 7 e (0, 1) are defined as a ^ log (^) , C = ^ 2^^! log (^) , and 7 = log (|) . 

We start with the second term in the max of (|9]l. Substituting Sq and the bound for po, one can show that for 
n > 15 both a < 1/2 and 7 < 1/2. This makes (1 + ^ ^ 7) < 4(1 - C)^ < 8^^ for ^ > 2.5, i.e. for 

So < 0.04. 

We finish this argument by noting that 8^^ = 16/ii log(l/5o) < npo/2\ there is in fact an 0(r log(n)) gap 
between the two. Similarly for the first term in the max of (|9]l, |r/io log {^j^ < ^Po/2; here the gap is 0(log(n)). 

Now we prove (|7]i, which follows from |12|. With > |r/.io log (^f^)> we have that U^Un is invertible with 
probabihty at least 1 — Sq according to Lemma 3 of lfT2l . This implies that 

U^x={uf,Uny'u^xn. (10) 

Call fli = U^x. Since x € S, Uai ~ x, and ai is in fact the unique solution to Ua = x. Now consider the equation 
Una = XQ. The assumption that U^Un is invertible implies that a-z = (U^Un) U^xn exists and is the unique 
solution to Una = xn- However, Unai = xn as well, meaning that ai = 02. Thus, we have 

\\xn - ^^0,512:^^112 = - UnU^xWl = 

with probability at least 1 — So- 

Now we prove paralleling Theorem 1 in |1T4|| . We use Assumption A3 to ensure that x ^ Sj, j — 2, . . . ,k. 
This along with (|9|l and Theorem 1 from lfT2l guarantees that 

(1+0" 



- a) - riio^ 

\\xn - Pn,s,xn\\l > — -^^\\x - Ps,x\\l > 

n 

for each j = 2, . . . ,k with probability at least 1 — 3So. With a union bound this holds simultaneously for all fc — 1 
alternative subspaces with probability at least 1 — 3(fc — 1)Sq. When we also include the events that ^ holds and that 
> npo/2, we get that the entire theorem holds with probability at least 1 — (3(fc — 1) + 2)So. □ 

Finally, denote the column to be completed hy xn- To complete xn we first determine which subspace it belongs to 
using the results above. For a given column we can use the incomplete data projection residual of (|7]). With probability 
at least 1 — (3(A:— l)+2)(5o, the residual will be zero for the correct subspace and strictly positive for all other subspaces. 
Using the span of the chosen subspace, U, we can then complete the column by using x = U {U^Un) U^xn- 

We reiterate that Lemma[8]allows us to complete a single column x with probability 1 — (3(fc — 1) + 2)Sq. If we 
wish to complete the entire matrix, we will need another union bound over all N columns, leading to a log factor in 
our requirement on po- Since N may be quite large in applications, we prefer to state our result in terms of per-column 
completion bound. 

The confidence level stated in Theorem l2.1l is the result of applying the union bound to all the steps required in 
the Sections 3, 4, and 6. All hold simultaneously with probability at least 

l-(6 + 3(/c-l) + 12so)(5o < 1 - (6 + 15so)'^o , 

which proves the theorem. 



7 Experiments 

The following experiments evaluate the performance of the proposed high-rank matrix completion procedure and 
compare results with standard low-rank matrix completion based on nuclear norm minimization. 
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7.1 Numerical Simulations 



We begin by examining a highly synthesized experiment where the data exactly matches the assumptions of our high- 
rank matrix completion procedure. The key parameters were chosen as follows: n ~ 100, N = 5000, k = 10, and 
r = 5. The k subspaces were r-dimensional, and each was generated by r vectors drawn from the A/'(0, /„) distribution 
and taking their span. The resulting subspaces are highly incoherent with the canonical basis for M". For each 
subspace, we generate 500 points drawn from a A/'(0, UU^) distribution, where U isarixr matrix whose orthonormal 
columns span the subspace. Our procedure was implemented using [3fc log fc] seeds. The matrix completion software 
called GROUSE (available here ifTSl ) was used in our procedure and to implement the standard low-rank matrix 
completions. We ran 50 independent trials of our procedure and compared it to standard low-rank matrix completion. 
The results are summarized in the figures below. The key message is that our new procedure can provide accurate 
completions from far fewer observations compared to standard low -rank completion, which is precisely what our main 
result predicts. 




Figure 2: The number of correctly completed columns (with tolerances shown above, lOe-5 or 0.01), versus the average number 
of observations per column. As expected, our procedure (termed high rank MC in the plot) provides accurate completion with only 
about 50 samples per column. Note that r log n ~ 23 in this simulation, so this is quite close to our bound. On the other hand, since 
the rank of the full matrix is rk — 50, the standard low-rank matrix completion bound requires m > 50 log n ~ 230. Therefore, it 
is not surprising that the standard method (termed low rank MC above) requires almost all samples in each column. 



7.2 Network Topology Inference Experiments 

The ability to recover Internet router-level connectivity is of importance to network managers, network operators and 
the area of security. As a complement to the heavy network load of standard active probing methods (e.g., ifTSl), which 
scale poorly for Internet-scale networks, recent research has focused on the ability to recover Internet connectivity from 
passively observed measurements ifTTl . Using a passive scheme, no additional probes are sent through the network; 
instead we place passive monitors on network links to observe "hop-counts" in the Internet (i.e, the number of routers 
between two Internet resources) from traffic that naturally traverses the link the monitor is placed on. An example of 
this measurement infrastructure can be seen in Figure |3] 

These hop count observations result in an n x matrix, where n is the number of passive monitors and N is the 
total unique IP addresses observed. Due to the passive nature of these observations, specifically the requirement that 
we only observe traffic that happens to be traversing the link where a monitor is located, this hop count matrix will 
be massively incomplete. A common goal is to impute (ot fill-in) the missing components of this hop count matrix in 
order to infer network characteristics. 

Prior work on analyzing passively observed hop matrices have found a distinct subspace mixture structure ||9], 
where the full hop count matrix, while globally high rank, is generated from a series of low rank subcomponents. 
These low rank subcomponents are the result of the Internet topology structure, where all IP addresses in a common 
subnet exist behind a single common border router This network structure is such that any probe sent from an IP in a a 
particular subnet to a monitor must traverse through the same border router. A result of this structure is a rank-two hop 
count matrix for all IP addresses in that subnet, consisting of the hop count vector to the border router and a constant 
offset relating to the distance from each IP address to the border router Using this insight, we apply the high-rank 
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Figure 3: Internet topology example of subnets sending traffic to passive monitors through the Internet core and common border 
routers. 



matrix completion approach on incomplete hop count matrices. 

Using a Heuristically Optimal Topology from ifTSl . we simulated a network topology and measurement infrastruc- 
ture consisting of iV = 2700 total IP addresses uniformly distributed over k = 12 different subnets. The hop counts 
are generated on the topology using shortest-path routing from n — 75 passive monitors located randomly throughout 
the network. As stated above, each subnet corresponds to a subspace of dimension r = 2. Observing only 40% of the 
total hop counts, in Figure |4] we present the results of the hop count matrix completion experiments, comparing the 
performance of the high-rank procedure with standard low-rank matrix completion. The experiment shows dramatic 
improvements, as over 70% of the missing hop counts can be imputed exactly using the high-rank matrix completion 
methodology, and approximately no missing elements are imputed exactly using standard low -rank matrix completion. 




Figure 4: Hop count imputation results, using a synthetic network with k = 12 subnets, n — 75 passive monitors, and A'^ = 2700 
IP addresses. The cumulative distribution of estimation error is shown with respect to observing 40% of the total elements. 

Finally, using real-world Internet delay measurements (courtesy of |fT9l ) from n — 100 monitors to = 22550 
IP addresses, we test imputation performance when the underlying subnet structure is not known. Using the estimate 
k — 15, in Figure|5]we find a significant performance increase using the high-rank matrix completion technique. 
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